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ABSTRACT 

Product review nowadays has become an important source of in- 
formation, not only for customers to find opinions about products 
easily and share their reviews with peers, but also for product man- 
ufacturers to get feedback on their products. As the number of 
product reviews grows, it becomes difficult for users to search and 
utilize these resources in an efficient way. In this work, we build a 
product review summarization system that can automatically pro- 
cess a large collection of reviews and aggregate them to generate 
a concise summary. More importantly, the drawback of existing 
product summarization systems is that they cannot provide the un- 
derlying reasons to justify users' opinions. In our method, we solve 
this problem by applying clustering, prior to selecting representa- 
tive candidates for summarization. 

Categories and Subject Descriptors 

H. 3.1 [Content Analysis and Indexing]: Abstracting methods; 

I. 2.7 [Natural Language Processing]: Text analysis 

General Terms 

Algorithms, Experimentation, Languages, Performance 

Keywords 

Sentiment Analysis, Summarization, Clustering 

1. INTRODUCTION 

Product reviews are an important source of information. Not 
only do customers use them to find opinions about products, but it 
also allows them to vent their frustrations and share successes with 
their peers. It also allows product manufacturers to receive feed- 
back on their product lines. Unfortunately, the number of reviews is 
overwhelming, making it difficult to search and utilize the resource. 
A user may not manage to read all relevant reviews for a product 
before needing to make a decision on whether to purchase it or 



not. The huge number of reviews also makes it difficult for product 
manufacturers to keep track of customer opinions of their products 
-e.g., how do the public find about the recently released models, 
and what features do they expect to improve in the next models. 
To address these issues, we build a product review summarization 
system that can automatically process a large collection of reviews 
and aggregate information into a readable summary. Our system 
aims at achieving the following two important goals: (1) to employ 
an efficient way to automatically identify topics and subtopics in 
the reviews (product facet identification), and (2) to automatically 
summarize the correspondent opinions and present a coherent sum- 
mary to users {summarization). 

In (1) product facet identification, our approach first identifies 
frequent product dimensions being discussed in a review set. We 
show that the integration of a new heuristic using sentences' syn- 
tactic roles into one of the current state-of-the-art systems achieves 
better performance in precision. In (2) summarization, we imple- 
ment a clustering algorithm that identifies a group of sentences 
sharing the same subtopic, before analyzing their sentiment and 
producing the desired output summary. Unlike previous approaches, 
the final summary is able to capture opinions from different di- 
mensions of the product. More importantly, it allows a potential 
customer to quickly see how the existing customers feel about the 
product, yet equip him/her with sufficiently detailed information. 

This report is an extended version of [21 1, which elaborates more 
on the approach used, and expands on the evaluation and analysis 
of our prior results. In Section (2] we review related work on sen- 
timent analysis and summarization. In Section |3] we propose our 
product review summarization system. In Section|4] we present the 
experimental results for evaluating our proposed approaches. Fi- 
nally, we conclude the paper with a summary and directions for 
future work in Section|5] 

2. RELATED WORK 

We divide the related work on the task of summarizing prod- 
uct reviews into two sub-fields: discovering the users' opinions ex- 
pressed in the reviews (sentiment analysis), and aggregating and 
arranging them in an appropriate output (summarization). 



2.1 Sentiment Analysis 

Sentiment analysis refers to the computational treatment of sub- 
jectivity (whether there exists sentiment), the sentiment polarity 
(positive, negative, neutral or a scale of sentiment intensity), and 
the opinion content information (opinion holder, topic of opinion, 
etc.), that underlies a text span. The granularity of the text span 
starts at the level of individual words, then phrases, sentences, and 
finally the entire document. These levels of granularity also offer 
a natural way of characterizing the techniques developed in senti- 
ment analysis. However, we do not discuss work at the document 
level, as the target of our work is not to examine the overall senti- 
ment of the review, but the detailed (and thus finer grained) opin- 
ions within the review. 

At the word level, Hatzivassiloglou and McKeown | 9 1 predicted 
the binary semantic orientation of adjectives. They utilized textual 
conjunctions (e.g., "and," "but") in a large training corpus between 
the target adjective and a seed list of adjectives with manual an- 
notated polarity, achieving an accuracy of 82% in average. Tur- 
ney et al. | 30| obtained comparable results with extended target 
words including not only adjectives, but also nouns, verbs and ad- 
verbs. Moreover, their system did not require a corpus as training 
data. Instead, they approximated the point-wise mutual information 
| 5 1 between the target word with the positive word "excellent" and 
with the negative word "poor," respectively, by counting the num- 
ber of results returned by Web searches matching queries that join 
each pair of words by a NEAR operator. Since the scores corre- 
spond to the similarity between the target adjective with each pos- 
itive/negative extreme, the polarity of that adjective can be deter- 
mined by taking the label that results in the prominent score. More 
recently, Hu and Liu 1 12| utilized WordNet |22 | - a large lexical 
database of English with synonym and antonym pointers - to grow 
a initial seed list of known orientation adjectives into a larger list 
that covers all the remaining adjectives in WordNet. Their system 
achieved higher results (accuracy of 84% in average) than the two 
aforementioned systems, due to WordNet's stronger sense of orga- 
nization compared with use of large text or Web corpora, as was 
used in the former two systems. 

The initial success of sentiment analysis at the word level pro- 
vides the necessary building blocks for studying larger units of 
texts as shown in [31 1 and |3 |. Both works established a positive 
and statistically significant correlation with the presence of adjec- 
tives on determining the subjectivity of sentences, as well as doc- 
uments. Furthermore, in determining the sentiment orientation of 
a sentence, Yu and Hatzivassiloglou |35 1, and Kim and Hovy |T4j 
aggregated the polarity of each individual adjective or sentimental 
word that appeared in the sentence itself. Following these works, 
Wiebe and Riloff |32|, Wilson et al. |33 |, and Kim and Hovy 1 15| 
introduced additional sentence-surface features (e.g., counts of pos- 
itive/negative adjectives in a target sentence, or in a window of pre- 
vious and next sentences; binary feature on whether the sentence 
contains a pronoun, etc. ) in a supervised manner, and then achieved 
fairly good results (up to an accuracy of 70%) in the same task. 

Nevertheless, in the domain of product reviews, finding the ori- 
entation of the sentence is generally not enough. In fact, it is nec- 
essary to identify the semantics of the opinion in the sentence, as 
the opinion holder may describe a particular facet of the subject 
in the review that users may be interested in. Typical examples 
of facets that belong to a camera product would be; battery life, 
lens, flash system, price, and so on. In the case of a music player, 
the facets are: sound system, battery life, weight/size, storage ca- 
pability, and so on. Hu and Liu |12| addressed this problem by 
first applying data mining techniques to extract facets of the prod- 
uct, then classifying the orientation for each of the sentences where 
the facets appear in as positive or negative using WordNet. Their 



system achieved promising accuracy of 72% in identifying prod- 
uct facets, and that of 84% in predicting facet orientation. Subse- 
quently, Popescu and Etzioni |23| introduced the use of relaxation 
labeling technique fT3l in their OPINE system to determine facet 
orientation, and achieved an accuracy of 78%. They deem neigh- 
boring facets that appear in the same sentence as the target facet 
based on surface linguistic connective cues, such as conjunctions 
and disjunctions. More recently. Ding etal. |6| proposed a state-of- 
the-art system that further incorporated a set of complex carefully- 
built grammar rules between adjacent sentence constructions as 
well as neighboring facets, together with a collection of compre- 
hensive polarity-annotated lists of idioms, nouns, verbs, adjectives 
and adverbs, to solve the same problem. The system achieved an 
accuracy of 92%, closely matching the upper bound of the perfor- 
mance of human perception. 

While the work on sentiment analysis discussed above make 
much of discovering the users' opinions in the reviews, few man- 
aged to aggregate these opinions together. In recent work, Sauper 
et al. 1 26 1 proposed an integrated approach that jointly learns prod- 
uct facets and user sentiments for product reviews using Bayesian 
topic models. Another approach to this problem is to view the ag- 
gregation task as a summarization task, which we review next. 

2.2 Summarization 

In the early stages of the opinion summarization, Turney et al. 1301 
produced a thumbs-up/thumbs-down indication for movie reviews 
as the output of its orientation classification component. The movie 
itself was treated as a single entity of interest. Refining this to 
cater to the detailed characteristics of products, Hu and Liu 1 121, 
and Popescu and Etzioni | 23 1 focused on product facets - distinc- 
tive features of the product that users often make comments upon 
~ and generated facet-driven summary, supported with sentence- 
level statistics, i.e., the number of positive/negative sentences that 
the facet belongs to. Subsequently, Liu et al. \ 19] extended the sin- 
gle facet-driven summary into a comparative-based summary be- 
tween many products, where the orientation of all shared facets are 
plotted together with their number of supporting sentences for vi- 
sualization. However, while users may prefer these systems for an 
at-a-glance presentation of products, they only provide only shal- 
low information. In such systems, while users can learn that how 
many people prefer or dislike a facet, it does not explicitly help 
users organize the (shared) underlying reasons for their opinions. 

Multi-document summarization techniques are more relevant, since 
the task does not address a single review but a set of reviews. 
The main characteristic of multi-document summarization is both 
leveraging and cleaning up the inherent redundancy of the input, 
where similar information often appears across different sources. 
Dejong |7| as well as Radev and McKeown |25| applied infor- 
mation extraction techniques to gather information from different 
sources, and generated summaries by filling those extracted infor- 
mation into some predefined sentence templates. However, their 
frameworks require significant background knowledge in order to 
create the detailed templates at a suitable level, and this fact results 
in domain-dependent system. Barzilay et al. 1 2 1 proposed a novel 
approach that does not depend on domain-specific knowledge. In 
their system, each sentence is first transformed into a predicate- 
argument structure called a DSYNT tree 1 16| with the nodes be- 
ing the sentence constituents. Under this representation, gram- 
mar dependencies between sentence constituents (subject-verb re- 
lation, adjective-noun relation, etc.) are captured and essentially 
abstracted from their ordering in the sentence. Therefore, with the 
assistance of a set of paraphrasing rules that are capable of rec- 
ognizing identical or similar predicates, they were able to derive 
rules to combine similar DSYNT trees of sentences from different 
sources together. The resulting tree is fed to a final sentence gener- 



ation component to formulate a new sentence. Carbonell and Gold- 
stein's maximum marginal relevance (MMR) |4| is another widely 
used technique in multi-document summarization; for example, Ye 
et al. 1 34 1 leveraged MMR to solve their summarization task on 
general news to obtain reasonable results. In details, MMR is an it- 
erative algorithm, which selects a sentence from the collection per 
round to insert into the final summary based on: (1) the selected 
sentence covering the most new information mentioned by the re- 
maining unselected sentences, and (2) the selected sentence also 
has minimum similarity with all previously selected sentences in 
the summary. The algorithm terminates either when a fixed number 
of sentences is selected, or when the content overlapping between 
any candidate sentence and the summary at that iteration exceeds a 
predefined threshold. 

2.3 Shortcomings of Related Work 

As described in Section |2J] there exists two systems 1121 and 
1231 that addressed the problem of product facet identification. How- 
ever, these systems only analyze users' opinions in the review and 
do not summarize these opinions. Furthermore, it is not clear how 
they constructed queries that combine a set of cue words associated 
with the product class (e.g., "of camera," "camera has," "camera 
comes with," and so on) and the candidate facet together. Our own 
early experiments with different query combinations also do not 
show consistent results with their systems. In recent work, Titov 
and McDonald 1 28 1 proposed a joint statistical model to find the 
set of relevant facets for a rated entity and extracted all textual men- 
tions that are associated with each other. But they focus on finding 
the set of facets and do not tackle summarization. 

We can see that the works of 1301 1121 1231 1191 focus on sen- 
timent analysis rather than summary generation, but do not ad- 
dress the problem of extracting the underlying reasons for an opin- 
ion. To solve this problem, in this paper, we apply summarization 
techniques to produce user-friendly product review output. Multi- 
document summarization |7 25 , 2 1, techniques that previously ex- 
perimented on news, have yet to be adapted for the domain of prod- 
uct reviews. Product reviews differ from news articles in that they 
may not be grammatically well-formed and crucially, involve senti- 
ment analysis. In [34J, the applied MMR variant requires a metric 
to compute the content similarity between any two sentences, but 
when it comes to our domain of product reviews that exhibits both 
content and sentiment information, it is difficult to define an appro- 
priate metric. 

To the best of our knowledge, there are no systems that com- 
bined sentiment analysis with summarization techniques to gener- 
ate product review summaries. Therefore, we have constructed a 
system that incorporates the results from both sentiment analysis 
and summarization, aiming to fuse the advantages of both tasks. 

3. PROPOSED METHOD 
3.1 Motivation 

In order to justify the need to discover the underlying reasons in 
users' opinions, we first compare between the outputs of existing 
product review summarization systems such as Google Producu, 
Bing Shopping Hu's system LllJ , and the output that we aim to 
produce in our system. 

Figure [T] shows two summaries, one that represents the existing 
systems, as well as one that represents our target output. Both sum- 
maries are structured naturally based on product facets. However, 
the summary in Figure [TJa) provides only the total number of pos- 
itive and negative sentences ((+) and (— ), respectively) for each 

'http://www.google.com/products/, as of 2010 
^http://www.bing. com/shopping/, also as of 2010 
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a. Lens 

(+): 57 sentences 

1. The lens feels very solid! 

2. 1 have taken a whole bunch of excellent pictures with this lens. 

( — ): 15 sentences 

1. I do not satisfy with the included lens kit. 

2. The lens cap is very loose and come off very easily! 

b. Battery life 

(+): 32 sentences 

1. The battery lasts forever on one single charge. 

2. The batteiy duration is amazing ! 



( — ): 8 sentences 

1. I experienced very short battery life from this camera. 

2. It uses a heavy battery. 

V '_ ) 

(a) Output summary produced by existing systems. 

f \ 

a. Lens 

{(+) The lens feels very solid! (+10 similar) 
( — ) I think the lens does not worth it, it's a bit too fragile. (+2 similar) 

{(+) I have taken a lot of excellent pictures with this lens. (+7 similar) 
( — ) Don't buy this lens, I always get my pictures bluned. (+0 similar) 

b. Battery life 

{(+) The battery lasts forever on one single charge. (+18 similar) 
( — ) I experienced very short battery life from this camera. (+4 similar) 



{{+) sentence 
{ — ) It uses a heavy battery. (+3 similar) 




(b) Desired output suimnary proposed by us. 



Figure 1: Comparison of summmaries obtained from (a) exist- 
ing, and (b) our proposed systems. 



facet, and there is no attempt to organize the sentences shown be- 
low the number. We see that users still need to review the (possibly 
numerous) individual sentences to discover the actual set of rea- 
sons that justify the given sentiment. Therefore, it does not satisfy 
the ultimate purpose of a summary. To address this, as illustrated 
in Figure[TJb), a summary that provides reasoning of the likes and 
dislikes is preferable, as it maks such direct information explicit. 

The reader may question that the proposed summary is similar 
to Figure [Ha) in structure, but simply with an additional level of 
subtopics. Here, we do point out that FigurefTtb) is not just a finer 
grained version of Figure[TJa). The grouping of subtopics provides 
a good form of reasoning and indication to users on what facets 
{e.g., lens, battery life) are liked/disliked. 

3.2 System Overview 

Figure |2] shows an overview of our product review summariza- 
tion system. Our system consists of two main components: 
(1) product facet identification, and (2) summarization. 

Aside from the text of the review itself, a review may also fea- 
ture additional information such as date, time, title author name and 
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Figure 2: System overview. 



star-based ratings. For inputs to the (1) product facet identification 
component, we do not use any of these information sources, relying 
on only the text body alone, so that our approach is most general. 
We first preprocess these sentences with a Part-of-Speech (POS) 
tagger to obtain the POS label for each word. In the next step, 
only those words that received the label as Noun or Adjective - 
being part of noun phrases ~ are collected and fed to the associa- 
tion mining module, which generates a list of candidate frequent 
product facets. This is followed by some post processing opera- 
tions in order to remove redundant results. Last but not least, all 
the adjectives associated with those frequent facets in the sentences 
are also gathered, and used as a means to look up those infrequent 
facets. Finally, opinionated sentences that contain product facet are 
extracted. 

In the (2) summarization component, the input are groups of sen- 
tences that belong to each of the product facets obtained from the 
(I) product facet identification component. We preprocess this list 
of facets to identify and remove insignificant facets. In the next 
step, we start considering sentences under each facet independently 
from others. Each group of sentences is sent to the subtopic cluster- 
ing module. This clustering module first defines a "sentence rep- 
resentation" based on the similarity between any two sentences, 
and then combines similar sentences to generate clusters. The out- 
put from this module is fed to the "compact presentation" module, 
which applies sentiment analysis and summarization techniques to 
generate the final summary. 

3.3 Product Facet Identification 

3.3.1 Assumptions 

It is important to justify that we follow the same assumption de- 
scribed in nil , so that we consider only product facets that appear 
as nouns or noun phrases; our method has the limitation that it can- 
not handle implicit facets that are not explicitly mentioned. To ex- 
plain this crucial point, suppose the following two sentences from 
camera reviews: 

(1) The pictures of this camera are very clear. 

(2) The camera fits nicely into my palm. 

In the sentence (1), the user expresses his/her satisfaction about the 
quality of the picture taken by the camera, and we can infer that 
the noun picture is a facet of the camera. On the other hand, the 



sentence (2) discusses the size of the camera. However, the word 
size does not appear explicitly in the sentence. In order to identify 
implicit product facets, we need deep semantic understanding of 
the domain, which implies that we have to rely on algorithms that 
have semantic knowledge of words, a difficult level of technology 
at the present time. Fortunately, explicit facets appear more often in 
the reviews than implicit ones. In our implementation, we consider 
a span of continuous words as a noun phrase when its rightmost 
word is a noun and the rest of the phrase is composed of nouns or 
adjectives (e.g., battery life, external flash). 

3.3.2 Preprocessing 
Part-of-Speech Tagging 

We utilize the Stanford POS Taggefl | 29 1 to process each input 
sentence and yield the part-of-speech (POS) label for each word. 
We observe that the tagger performs fairly well at identifying the 
correct label for nouns and noun phrases, even though there are a 
number of oddly-structured sentences present in the reviews. We do 
not consider stopwords in the tagging results, while the remaining 
noun and noun phrases are also converted to their stemmed version 
using the Porter stemmeiQ |24|. The following shows a sentence "I 
recommend this camera for excellent picture quality" with the POS 
tag (TVA'^ and J, J are labels for noun and adjective respectively): 

l/PRP recommend/VB this/DT camera/A^iV for//iV 
excellent/ J J picture/iVTV quality /A^iV . 

Syntactic Roles 

We need to further refine the performance of our module in terms 
of precision by filtering away noisy results. For instance, the fol- 
lowing words are all accepted as candidate product facets when we 
process a set of camera reviews: "light," "hand," "time," "month," 
"hour," and so on. While these nouns often appear in the reviews, 
they are not pruned by any of the statistical criteria employed in Hu 
and Liu's system llll . Therefore, we introduce the use of syntactic 
roles within a sentence as a feature to help distinguish a genuine 
product facet from such noisy ones. Consider the following sen- 
tences parsed by Stanford Dependency Parse£| 1171 : 



^ http : // nip . Stanford . edu/sof tware/ tagger, shtml 
"^http : //w w w. tartarus . orgrmartin/PorterS temmer/ 
^ http : // nip . Stanford, edu/sof tware/lex-parser. shtml 



(1) The larger lens of the g3 gives better picture quality in low 
light. 

. . . , nsubj(gives-7, lens-3), . . . , dobj(gives-7, quality-10), . . . 

(2) When I took outdoor photos with plenty of light, the photos 
were awesome. 

. . . , dobj(took-3, photos-5), . . . , nsubj(awesome-14, photos- 
12),... 

(3) My fiance just did not like the size, it is so small in her hand. 

. . . , dobj(like-6, size-8), . . . 

According to the examples above, we observe that genuine facets 
tend to appear as either subjects or objects within the sentences. In 
fact, our analysis on a subset of camera reviews (more than 300 sen- 
tences that contain some facets over 24 reviews) shows that more 
than 90% of the instances correspond to the above observation. 

This is not too surprising as subjects and objects in the sen- 
tences are usually the targets at which the users express their opin- 
ions. These findings suggest that we can filter non-subject and non- 
object nouns and noun phrases from the set of identified candidate 
facets. Compared with the processing pipeline in Hu and Liu's sys- 
tem [ 1 1 1, we introduce our own heuristic during the preprocessing 
step so that only those legitimate noun or noun phrases are deliv- 
ered to the association mining step, in addition to the infrequent 
facet extraction step where the system does not extract those noun 
or noun phrases that do not appear above a certain number of times. 

3. 3. 3 Association Rule Mining 

In this component, we use association rule mining technique |T| 
to statistically identify all the frequent explicit product facets. Be- 
fore we draw the relation between association rule mining and our 
domain of interest, we outline the general descriptions of this tech- 
nique as follows: 

Items: 

An item is the smallest entity being considered in a particular do- 
main of interest. An itemset is a set of items, and the set of all items 
is denoted as /. 

Transaction: 

Transaction t contains itemset X if X C t. The set of all transac- 
tions is denoted as D. 

Association Rule: 

X where X C I,Y C I and X nY ^ D 
Support: 

supp{X) is the number of transactions in D that contain itemset 
X. If applied to a rule, supp(X => F) = supp{X U Y). 

Confidence: 

cond{X Y) is the number of transactions in D that contain 
itemset X if only contain itemset Y. 

The mining of association rules is then defined as generating all 
possible rules that have support and confidence greater than the 
user-defined minimum values. The Apriori algorithm 11] solves 
this using the following two phases: (i) Identify all frequent item- 
sets that satisfy the minimum support, and (ii) Generate rules from 
those discovered frequent itemsets that satisfy the minimum confi- 
dence. 

When we apply this algorithm to our approach, the items are the 
nouns and noun phrases extracted from the "Preprocessing" step 
and the transactions are the sentences containing those nouns and 



noun phrases. We only need to run the first phase of the Apriori 
solution in order to obtain the set of frequent itemsets, or equiv- 
alently the set of candidate frequent product facets. At the same 
time, we also conveniently obtain the ranking for this set of can- 
didate frequent product facets based on their support values. This 
ranking is an important aspect that we utilize in the downstream 
summarization module when presenting information to the users. 

3.3.4 Postprocessing 

As we consider a large portion of possible nouns and noun phrases 
appeared in the review, not all are genuine facets; i.e., some of them 
are not interesting or redundant. Therefore, post processing step re- 
moves those irrelevant facets by applying the following rules: 

Usefulness Pruning 

This criterion focuses on removing single-word facets that are 
likely to be meaningless. For example, in the context of camera 
reviews, life itself is not a useful facet, while battery life is a mean- 
ingful facet. We can solve this problem by computing the pure 
support of a facet /, which is defined as the number of sentences 
that / appears alone without being subsumed by any other facets. 
If this number is below a predefined threshold, there is a strong ev- 
idence that we can just keep the superset of / as the useful facet. 

Compactness Pruning 

This criterion targets redundant facet phrases - noun phrases that 
are discovered as facets. For example, photo pixel, sample image 
are not as compact as pixel and image. For each of words that the 
phrase contains, we compute the ratio between the support of the 
phrase and the support of that individual word. If any of these ratios 
is less than predefined threshold, we prune the facet phrase. 

3. 3. 5 Infrequent Facet Extraction 

As stated thus far, association mining is not able to discover in- 
frequent product facets, as they have fairly low support value. How- 
ever, in the case of product facets, users tend to put similar opinion 
words. To illustrate this fact, let us examine the following two sen- 
tences: 

(1) The camera takes absolutely amazing pictures. 

(2) The accompanied software is amazing. 

In Sentence (1), picture is a frequent facet that has been identified 
by our association mining module, while software in Sentence (2) 
is an infrequent one, and thus rejected by frequency. On the other 
hand, we observe that they have the common adjective amazing. 
Hence, our heuristic works in the following two steps: (i) gather all 
opinion words that modify frequent facets; (ii) if a sentence con- 
tains an infrequent facet candidate, but is modified by one or more 
of the opinion words from (i), the nearest noun and noun phrase 
is included as a facet. In this way, we can recover "software" as a 
product facet. 

3.4 Summarization 

3. 4. 1 Opinionated Sentence Extraction 

Sentences that contain any of the product facets that we have dis- 
covered are labeled with that corresponding facet. A sentence can 
be assigned to more than one facet, as that sentence may discuss a 
relation between many facets. The following instances show sen- 
tences being labeled with one and two product facets respectively: 

(1) The lens blocks the viewfinder when the lens is set to wide 
angle. 

(2) The 10 megapixels produces really sharp pictures. 



It is important to note tliat we do not feed all labeled sentences into 
the summarization component. We choose opinionated sentences 
only, since we place larger emphasis on summarizing users' opin- 
ions in this work. In order to achive this, we apply the technique 
of sentiment analysis to filter the labeled sentences based on the 
approach proposed in |6|: we first prepare a seed list of known- 
polarity adjectives using synonym/antonym pointers in WordNet, 
and cover the other unknown adjectives. The sentence polarity is 
then determined as the summation of all subjectivity scores of those 
adjectives in the sentence. If the resulting summation score is pos- 
itive (negative), the sentence is classified as positive (negative). 

Similarity Pruning 

Users can also employ synonyms to mention the same facet. For 
example: picture versus image, photo; or screen versus monitor. 
However, they are treated as different genuine facets in Hu and 
Liu's system llll . If we follow this definition, different pieces of 
summary for the same facet will be produced, which is not desir- 
able. To solve this problem, we apply Kong et al.'s word seman- 
tic similarity measure | 20| to compute the similarity between any 
of two candidate facets. If the score is greater than a predefined 
threshold, the two words (and hence their correspondent sentences) 
are combined together. 

Kong et al. 1201 constructed an edge-counting based model that 
considers the depth of least common subsumer and the shortest path 
length between any two words in WordNet. Formally, given two 
words uii and W2, the semantic similarity s„(wi,«;2) is defined 
by Equation l|T]l: 



W2) = 



fid) 



fid) + f{l)- 



(1) 



where I is the length of the shortest path between wi and TO2, d 
is the depth of the least common subsumer in the WordNet hierar- 
chical semantic net, and f{x) denotes the transfer function for d 
and I. For Sw{'Wi,W2), the interval of similarity is [0, 1], 1 for the 
maximum similarity and for no similarity at all. We follow the 
experimental results shown in i20l and choose f{x) = — 1. The 
resulting formula is: 



(0<Q,/3< 1), 



(2) 



where a and /3 are smoothing factors. As reported in |20|, the op- 
timal values of a and /? are both 0.25. We also use these optimal 
values in our experiments. 

Sentence Representation and Similarity Measurement 

After identifying product facets, sentences are analyzed to de- 
termine their subjectivity. To facilitate the subsequent clustering 
algorithm, we decide to adopt a simple yet novel sentence repre- 
sentation, together with a sentence similarity measurement scheme 
proposed in |T8|, which yields state-of-the-art results. At a high- 
level view, the algorithm utilizes a dynamic vector representation 
that adapts to the size of the sentence, and computes the cosine 
similarity between two sentence vectors. 

The algorithm starts with identifying "concepts" in the sentence 
|34|. Concepts are defined as those open class words (nouns, verbs, 
adjectives and adverbs, excluding stopwords) in the sentence. We 
additionally employ the restriction on syntactic roles, described in 
Section 13.3.21 so that we only include those words that hold sub- 
ject and object roles in the sentence. In detail, we extract impor- 
tant nouns that are subject or object, main verbs/adjectives asso- 
ciated with those important nouns, adverbs that modify the main 
verb/adjectives. Then given two sentences for which we want to 
compute similarity, si with the set of concepts Ci, and S2 with the 



Assume that the following two sentences with the underlined 
concepts: 

si — The battery of this camera is very impressive. 
S2 ~ Canon camera always has a long battery life . 

Therefore, the joint vector is denoted as follows: 

C = {battery, camera, impressive, has, long, life} 
The resulting sentence vectors Vi and V2 are as follows: 

Vi = {1.0,1.0,1.0,0.0,0.3,0.15} 
V2 = {1.0,1.0,0.3,1.0,1.0,1.0} 

The semantic similarity between two sentences, si and S2 is 
computed as follows: 

sim{si, S2) — 0.69 



Figure 3: Example of sentences together with their vector rep- 
resentation. 

set of concepts C2, we define a joint concept vector C — CiU C2. 
In the next step, Vi - the vector representation for Si (i = 1, 2) - is 
created, with size equal to that of C, whose values are determined 
by the following rules: 

At index k, 

• If Si contains C[k] - concept at fc*'' index in the joint vector, 
Vi[k] is set to 1.0. 

• If Si does not contain C[k], a semantic similarity score is 
computed between C[k] with all concepts in that sentence. 
Vi [k] is then set to the highest similarity score. We apply the 
same Equation ^ to compute similarity. 

The semantic similarity between two sentences si and S2 can 
now be measured by the cosine similarity between the two repre- 
sentative vectors Vi and V2, respectively, which results in a score 
within the range [0, 1]. This similarity is defined by Equation (O: 



-m(.„.2)=||^^||.||^^|,. 
Figure [3] shows an example of the above steps for clarification. 



(3) 



3.4.2 Subtopic Clustering for Summarization 

Once all pairwise similarities are calculated, we feed the set to 
the sentence clustering module. We implemented both hierarchical 
and non-hierarchical algorithms to compare their performances. 

(1) Hierarchical Clustering 

We apply hierarchical clustering in an agglomerative (bottom- 
up) manner. Individual sentences are initialized as singleton clus- 
ters, and are iteratively merged to form clusters with the minimum 
pairwise distance together. This continues until a terminating cri- 
terion is satisfied. The well-known pairwise cluster distances are 
complete-link, single-link and groupwise-average. Among them, 
we employ groupwise-average distance as our preliminary experi- 
mentation shows that it performs more consistently. Given two dif- 
ferent clusters Ci and Cj , the groupwise-average distance is defined 
as follows: 



sim{ci,Cj) 
1 

|ciUCj| (|ciUc^.|-l) 



51 5Z sim{x,y). 



Too many small clusters result in an excessively detailed summary 
and an over-estimation of the number of actual subtopics, while 
a few large clusters result in a summary that omits important in- 
formation. Therefore, we adopt an algorithm proposed in 1 8 1 to 
estimate the final number of clusters. The clustering process will 
terminate as soon as the number of clusters exceeds this value. In 
1 8 1, they first defined the notion of links: if the semantic similarity 
score between any two sentences are greater than a certain thresh- 
old, a link is posited, joining the two sentences together. Therefore, 
if we compute the similarity score for every two sentences in the 
collection and apply the notion of links, a graph with the vertex be- 
ing sentences, and edges representing those links will be created. 
Then the number of estimated clusters c given the input of n sen- 
tences that correspond to a graph with m connected components is 
defined as follows: 



m + 



log(-^) 
log(P) 



(4) 



where L is the observed number of links. In addition, the maximum 
possible number of links P is defined as follows: 



P = 



n{n - 



(2) Non-hierarchical Clustering 

We also implement a non-hierarchical clustering technique, the 
exchange method |27|, which regards the clustering problem as 
an optimizing task. The algorithm seeks to minimize an objective 
function <& that measures the intra-cluster dissimilarity between a 
partition P — {Ci, C2, • ■ • , C'k}'- 



(1 - sim{x,y)) 



(5) 



The same estimation on the number of final clusters mentioned ear- 
lier is first applied to determine the size of the partition P. The 
algorithm then proceeds by creating an initial assignment of the 
sentences into the partition, and looking for locally optimal moves 
("swaps") of sentences between clusters that improve "I> in each it- 
eration until convergence. Since this is a hill-climbing method, it 
is necessary to call the algorithm multiple times, with random par- 
tition of sentences into the clusters each time. The optimal overall 
configuration will be selected as the final clustering result. 

(3) Compact Presentation of Sentences 

This step generates and presents the resulting target summary 
shown in Figure[T](b). It considers sentence clusters from all facets 
generated by the previous "Subtopic Clustering" component. By 
applying the sentiment analysis technique described in Section [3.4. II 
we can determine the orientation for every sentence in a particular 
subtopic. With this information, we are able to partition the sen- 
tences in each subtopic based on their polarity. The subsequent task 
is to select the most representative sentence for each partition. The 
selected sentence must represent the maximum information present 
in the other sentences; in other words, the target sentence is most 
similar to all the remaining sentences. Thus, we define a metric to 
compute the representative power of a sentence as follows: 

For each sentence Si in the correspondent positive/negative par- 
tition P, we define its representative power Rep{si) as follows: 



Rep{si) = ^ sim{si,Sj). 



(6) 



The sentence with the highest representative power will be selected 
as the output sentence to users. Finally, for the user's quick refer- 



ence, we also supplement the selected sentence with the number of 
sentences sharing the same point of view. 

4. EXPERIMENTS 

4.1 Experimental Data and Measure 

4.1.1 Experimental Data 

In our experiments, we use publicly available sets of reviews for 
three products (camera, phone, and DVD) [ 1 1 1. This dataset is di- 
rectly compatible to our "product facet identification" component, 
since we evaluate our implemented version of Hu and Liu's system 
and our proposed system in the exactly the same way as in II II . 
In addition, to evaluate the summarization component, we prepare 
our own labeled data, which consists of sentences being partitioned 
into subtopics for a set of 22 most frequent facets extracted from 
those three products. The inter- annotator agreement between two 
annotators was 85%. The final extraction of the data for evaluation 
that reached both annotators' consensus was 90%. 

4.1.2 Evaluation Measure for Product Facet Identi- 
fication 

We use the standard precision and recall measures to evaluate 
the performance of our product facet identification component. Let 
MP and SF be manually extracted facets and system extracted 
facets, respectively. Precision [Pre) and recall (Rec) are defined 
as follows: 



Pre = 



\{MF}n{SF}\ 
\{SF}\ ■ 



Rec — 



\{MF}n{SF}\ 
\{MF}\ ■ 



4.1.3 Evaluation Measure for Summarization 

In order to evaluate the performance of our summarization com- 
ponent, we use purity, inverse purity, and F-measure (the harmonic 
mean of purity and inverse purity) that are widely used clustering 
measures [ 10|. 

Purity is related to the precision measure. This measure focuses 
on the frequency of the most common category in each cluster, and 
rewards the clustering algorithm that introduce less noise in each 
cluster. Let C, L, and n be the set of automatic clusters to be 
evaluated, the set of manual annotated clusters, and the number 
of sentences to be clustered, respectively. Purity is computed by 
taking the weighted average of maximum precision values: 

IC- 1 

Purity — y - — — max Precision (Ci, L,), 

i 

where the precision of an automatic cluster d for a given manual 
subtopic -Lj is defined as: 



Precision (Ci, Lj 



\Ci n Lj 



Inverse Purity focuses on the cluster with maximum recall for 
each category, rewarding clustering solutions that gather more el- 
ements of each category in a corresponding single cluster. Inverse 
Purity (FPurity) is defined as follows: 

FPurity — J — — maxPrecision(Li, Cj). 



The F-measure Fa that is the harmonic mean of purity and in- 
verse purity is also defined as follows: 



Fa 



1 



Purity 



+ (l~a) 



Inverse Purity 



Table 1: Performance of the product facet identification component in Hu and Liu fill . 



Data 


Number of manually 


Association mining 


Post processing 


Infrequent facet 




extracted facets 


Recall 


Precision 


Recall 


Precision 


Recall 


Precision 


Camera 


79 


0.671 


0.552 


0.658 


0.825 


0.822 


0.747 


Phone 


67 


0.731 


0.563 


0.716 


0.828 


0.761 


0.718 


DVD 


49 


0.754 


0.531 


0.754 


0.765 


0.797 


0.793 


Average 


65 


0.719 


0.549 


0.709 


0.806 


0.793 


0.753 



Table 2: Performance of our product facet identification component, comprising of Hu and Liu's system fTT\ + the use of syntactic 
roles. 



Data 


Number of manually 


Association mining 


Post processing 


Infrequent facet 




extracted facets 


Recall 


Precision 


Recall 


Precision 


Recall 


Precision 


Camera 


79 


0.671 


0.646 


0.658 


0.894 


0.822 


0.842 


Phone 


67 


0.731 


0.648 


0.716 


0.903 


0.761 


0.769 


DVD 


49 


0.754 


0.610 


0.754 


0.818 


0.797 


0.867 


Average 


65 


0.719 


0.634 


0.709 


0.872 


0.793 


0.826 



In our evaluation, we set the value of a to 0.5, and denote it as Fi 
(rather than Fq.s to follow standard Fi semantics) in the following. 

4.2 Experimental Results 

4.2. 1 Product Facet Identification 

Tables [T] and |2]show the results of our implemented version of 
Hu and Liu's system 1 1 1 1, and the results when we integrate heuris- 
tic of syntactic roles into their system, respectively. Table [T] shows 
that our reimplementation can achieve the results reported in 111]. 
We observe that the system identifies most of the common facets 
such as battery, picture, lens for camera, signal, headset for phone 
and remote control, format for DVD player We observe an im- 
provement in precision in Table [2] as most of noisy results have 
been filtered away using syntactic role information. For example, 
in Camera dataset, while the precision in infrequent facet extraction 
in Table[T]achieves 0.747, the precision, infrequent facet extraction 
in Table |2] achieves 0.842. This shows 0.095 improvement. How- 
ever, we observe no improvement in recall since the syntactic role 
heuristic is a filter, eliminating noise rather than adding new results. 

4.2.2 Summarization 

Table[3]shows the results for the summarization component. Each 
of facets contains different number of subtopics, even as low as one. 

For example, the Price facet in the DVD product actually has 
no subtopic, resulting in just one manually defined cluster The 
reason is that users only express their opinions toward two extremes 
on whether the DVD player is expensive or affordable (note that 
subtopic is independent of sentiment information). Similarly, for 
the Format facet in the DVD product, users only discuss whether 
the DVD player can play all video formats or not. Thus, the number 
of manually defined clusters is also one. 

On the other hand, some facets have a lot of subtopics (e.g.. 
Lens in Camera (7 subtopics), LCD in Camera (6 subtopics), 
etc.). This is due to the fact that they exhibit many different proper- 
ties (the size, ease of use, price, etc. for the lens, or the resolution, 
material, color, etc. for LCD). Users do discuss the many an- 
gles of these subtopics. We also observe that the common facet 
Service in Phone produces more subtopics (5 subtopics) com- 
pared with those mentioned in DVD (1 subtopic). This is because 
generally. Phone users tend to compare among many different ser- 
vice providers, while DVD users only complain about the service 
of that particular manufacturer in the review, with almost no com- 
parison to its competitors. 



Interestingly, the number of subtopics varies not only from facet 
to facet, but also from product to product. In our data, the product 
Camera shows the greatest number, about 5 subtopics per facet on 
average, while DVD only contains 2 subtopics per facet on aver- 
age. This can be explained from the above observation: the facets 
that belong to Camera usually have richer properties to be com- 
mented on compared with those belong to DVD. Interestingly, 
this also impacts the performance of our clustering algorithm. 

We compare the performance of our algorithms with a baseline, 
which randomly assigns sentences to clusters. Note that the number 
of clusters is determined by the estimation in Equation lO, before 
the clustering process starts. The estimated cluster number is fed 
to the random algorithm as well (for comparison). We record the 
average performance of the random clustering baseline over 200 
trials. For the non-hierarchical clustering approach, we also ex- 
ecute the algorithm 200 times, in order to ameliorate the effect of 
occurrences where the algorithm is trapped in a local minimum. We 
record the run that minimizes the objective function in Equation lO 
the best. However, we need to execute the hierarchical clustering 
algorithm only once, as it is a deterministic algorithm given the 
estimated number of final clusters. 

The last row in each product data in Table [3] shows the relative 
performance of the proposed algorithms with respect to the baseline 
of random clustering. According to Table[3] our two proposed clus- 
tering algorithms always outperform the baseline of random clus- 
tering by a significant amount. 

On the other hand, we observe small differences in the aver- 
age performance between the hierarchical approach and the non- 
hierarchical one. The non-hierarchical approach tends to perform 
better when the number of subtopics is large (e.g.. Lens in Cam- 
era, Service in Phone), but performs worse when the number of 
subtopics is small (e.g.. Service in DVD). An analysis shows that 
when more subtopics exist, the non-hierarchical approach has abet- 
ter chance to reach the global solution as every move/swap opera- 
tion it suggests affects the objective function. However, when we 
have small number of subtopics, its move/swap operation is not as 
effective, and the algorithm also terminates quickly; while the hi- 
erarchical approach using average-link distance keeps a better bal- 
ance between the clusters. 

We have shown that both hierarchical and non-hierarchical clus- 
tering outperform the baseline of random clustering in all three 
products. Camera, Phone, and DVD. However, we observe that 
the marginal percentage in performance between them tends to de- 
crease as the number of subtopics reduces. In most cases, with a 



Table 3: Performance of the Summarization component. 



Data 


Facet 


Number of manually 
defined clusters 


Hierarchical clustering 


Non-hierarchical clustering 


Random clustering 


t^uvity 


I -J~'uTity 


Fi 


t^uvity 


I-J~'uvity 




Fuvity 


I-J~''UTity 




Camera 


Battery 


4 


0.864 


0.591 


0.702 


0.864 


0.636 


0.733 


0.864 


0.455 


0.596 


Memory 


3 


0.643 


1.000 


0.783 


0.643 


0.786 


0.707 


0.500 


0.643 


0.563 


Hash 


4 


0.556 


0.722 


0.628 


0.667 


0.722 


0.693 


0.500 


0.611 


0.550 


LCD 


6 


0.478 


0.826 


0.606 


0.565 


1.000 


0.722 


0.348 


0.739 


0.473 


Lens 


7 


0.792 


1.000 


0.884 


0.792 


1.000 


0.884 


0.500 


0.667 


0.571 


Megapixels 


5 


0.621 


0.483 


0.543 


0.724 


0.552 


0.626 


0.552 


0.414 


0.473 


Mode 


6 


0.813 


1.000 


0.8!>7 


0.813 


1.000 


0.897 


0.500 


0.625 


0.556 


Shutter 


6 


0.643 


0.929 


0.760 


0.643 


0.929 


0.760 


0.429 


0.786 


0.555 


Average 


5.13 


0.676 


o.m 


0.725 


0.714 


0.828 


0.753 


0.524 


0.617 


0.542 


Phone 


Battery 


3 


0.824 


0.765 


0.793 


0.765 


0.706 


0.734 


0.706 


0.588 


0.642 


Camera 


3 


li.in 


0.636 


0.679 


0.111 


0.636 


0.679 


0.727 


0.545 


0.623 


Headset 


4 


0.467 


0.733 


0.570 


0.400 


0.600 


0.480 


0.400 


0.667 


0.500 


Radio 


3 


0.737 


0.737 


0.737 


0.737 


0.737 


0.737 


0.737 


0.579 


0.648 


Service 


5 


0.438 


0.875 


0.583 


0.563 


1.000 


0.720 


0.375 


0.625 


0.469 


Signal 


3 


0.S24 


0.941 


0.878 


0.824 


0.765 


0.793 


0.824 


0.588 


0.686 


Size 


3 


0.760 


0.680 


0.718 


0.920 


0.680 


0.782 


0.720 


0.520 


0.604 


Speaker 


4 


0.684 


0.895 


0.775 


0.684 


0.789 


0.733 


0.684 


0.632 


0.657 


Average 


3.50 


0.682 


0.783 


0.717 


0.702 


0.739 


0.722 


0.647 


0.593 


0.604 


DVD 


Price 


1 


1.000 


0.714 


0.833 


1.000 


0.762 


0.865 


1.000 


0.524 


0.688 


Remote 


4 


0.625 


0.750 


0.682 


0.563 


0.750 


0.643 


0.500 


0.688 


0.579 


Format 


1 


1.000 


0.714 


0.833 


1.000 


0.571 


0.727 


1.000 


0.500 


0.667 


Design 


1 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 


1.000 


Service 


1 


1.000 


0.739 


0.850 


1.000 


0.522 


0.686 


i.OOO 


0.522 


0.686 


Picture 


4 


o.soo 


0.850 


0.824 


0.800 


0.850 


0.824 


0.450 


0.500 


0.474 


Average 


2.00 


li.m 


0.795 


0.837 


0.894 


0.743 


0.791 


0.825 


0.622 


0.682 



reliable sentence similarity measurement, the estimated number of 
final clusters is indeed very close to the annotated subtopics. When 
we have only a few topics, the estimated number of final clusters is 
also small. Under this condition, each sentence assigned by the ran- 
dom clustering algorithm also has a higher chance of assigning the 
correct cluster. As a result, we do not observe a large improvement 
for our proposed clustering algorithms over the random algorithm. 
On the other hand, if we have many topics, the estimated number 
of final clusters also becomes larger. This is why the random as- 
signment gets little success in assigning sentences to the correct 
clusters. 

5. CONCLUSION 

In this work, we have proposed a system that can summarize 
product reviews. Existing systems related to product reviews sum- 
marization usually constructed a facet-based summary, which can 
aggregate sentiment information that belongs to each facet. We 
have implemented this similar method as the first component in our 
system. We improve this component's performance by applying 
syntactic role information within a sentence. 

More importantly, since we showed the existence of underly- 
ing subtopics within facets, we introduced a second task that actu- 
ally summarizes the reviews from a deeper perspective. Our sum- 
marization coinponent proceeded by grouping sentences about the 
same subtopics together, and provided a compact summary with the 
sentiment information to the users. We introduced a clustering ap- 
proach to solve the subtopic problem. Nevertheless, the approach is 
highly dependent on the semantic similarity between words as well 
as sentences, which is a problem that we cannot completely solve 
without some forms of manual input. In addition, we do not utilize 
deep semantic information in determining the similarities between 
sentences. If we are able to analyze such semantics, our system 
may be able to achieve better performance. 

Several extensions from our current system are possible. Dif- 
ferent brand names that belong to a particular product class {e.g., 
Nikon, Canon (Camera); Pioneer (DVD); iPod (Music Player), etc.). 



or product/manufacture names of the accessories that go together 
with the main product {e.g., Kingston (compact flash card for cam- 
era), Nvidia (graphic card for computer, etc.), are all treated as gen- 
uine facets in the annotation from the dataset. However, in most 
cases, they appear together with some other facets when compari- 
son is made between that product and its competitors ("My Canon 
camera has longer battery life than Nikon"). These general/proper 
entities are not very useful for summarization and should be ex- 
cluded. It is one of the future works to build a module that recog- 
nizes these proper names and excludes them. Comparative-based 
summarization system would benefit directly from our systems, as 
it is now able to compare product facets at a more fine-grained 
level. Alternatively, as our summarization system only generates 
extractive-based summary, it might be more desirable to have a 
system that can reformulate the output sentences from our subtopic 
clustering and provides users with content. Last but not least, more 
useful metadata about the reviews such as title, users' ratings, and 
so on can also be augmented to the summarization system. 
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